Goto

Collaborating Authors

 open-source community


Is Open Source the Future of AI? A Data-Driven Approach

Vake, Domen, Šinik, Bogdan, Vičič, Jernej, Tošić, Aleksandar

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have become central in academia and industry, raising concerns about privacy, transparency, and misuse. A key issue is the trustworthiness of proprietary models, with open-sourcing often proposed as a solution. However, open-sourcing presents challenges, including potential misuse, financial disincentives, and intellectual property concerns. Proprietary models, backed by private sector resources, are better positioned for return on investment. There are also other approaches that lie somewhere on the spectrum between completely open-source and proprietary. These can largely be categorised into open-source usage limitations protected by licensing, partially open-source (open weights) models, hybrid approaches where obsolete model versions are open-sourced, while competitive versions with market value remain proprietary. Currently, discussions on where on the spectrum future models should fall on remains unbacked and mostly opinionated where industry leaders are weighing in on the discussion. In this paper, we present a data-driven approach by compiling data on open-source development of LLMs, and their contributions in terms of improvements, modifications, and methods. Our goal is to avoid supporting either extreme but rather present data that will support future discussions both by industry experts as well as policy makers. Our findings indicate that open-source contributions can enhance model performance, with trends such as reduced model size and manageable accuracy loss. We also identify positive community engagement patterns and architectures that benefit most from open contributions.


VITA: Towards Open-Source Interactive Omni Multimodal LLM

Fu, Chaoyou, Lin, Haojia, Long, Zuwei, Shen, Yunhang, Zhao, Meng, Zhang, Yifan, Dong, Shaoqi, Wang, Xiong, Yin, Di, Ma, Long, Zheng, Xiawu, He, Ran, Ji, Rongrong, Wu, Yunsheng, Shan, Caifeng, Sun, Xing

arXiv.org Artificial Intelligence

The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: https://vita-home.github.io.


Exclusive: Renowned Experts Pen Support for California's Landmark AI Safety Bill

TIME - Tech

On August 7, a group of renowned professors co-authored a letter urging key lawmakers to support a California AI bill as it enters the final stages of the state's legislative process. In a letter shared exclusively with TIME, Yoshua Bengio, Geoffrey Hinton, Lawrence Lessig, and Stuart Russell argue that the next generation of AI systems pose "severe risks" if "developed without sufficient care and oversight," and describe the bill as the "bare minimum for effective regulation of this technology." The bill, titled the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act, was introduced by Senator Scott Wiener in February of this year. It requires AI companies training large-scale models to conduct rigorous safety testing for potentially dangerous capabilities and implement comprehensive safety measures to mitigate risks. "There are fewer regulations on AI systems that could pose catastrophic risks than on sandwich shops or hairdressers," the four experts write.


Arcee's MergeKit: A Toolkit for Merging Large Language Models

Goddard, Charles, Siriwardhana, Shamane, Ehghaghi, Malikeh, Meyers, Luke, Karpukhin, Vlad, Benedict, Brian, McQuade, Mark, Solawetz, Jacob

arXiv.org Artificial Intelligence

The rapid expansion of the open-source language model landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.


The Case for Universal Basic Computing Power

Zhu, Yue

arXiv.org Artificial Intelligence

The Universal Basic Computing Power (UBCP) initiative ensures global, free access to a set amount of computing power specifically for AI research and development (R&D). This initiative comprises three key elements. First, UBCP must be cost free, with its usage limited to AI R&D and minimal additional conditions. Second, UBCP should continually incorporate the state of the art AI advancements, including efficiently distilled, compressed, and deployed training data, foundational models, benchmarks, and governance tools. Lastly, it's essential for UBCP to be universally accessible, ensuring convenience for all users. We urge major stakeholders in AI development large platforms, open source contributors, and policymakers to prioritize the UBCP initiative.


The Leak That Has Big Tech and Regulators Panicked

Slate

In February, Meta released its large language model: LLaMA. Unlike OpenAI and its ChatGPT, Meta didn't just give the world a chat window to play with. Instead, it released the code into the open-source community, and shortly thereafter the model itself was leaked. Researchers and programmers immediately started modifying it, improving it, and getting it to do things no one else anticipated. And their results have been immediate, innovative, and an indication of how the future of this technology is going to play out.


PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods

#artificialintelligence

To our knowledge, PMLB represents the largest publicly available collection of curated, ready-to-use ML benchmark datasets for classification and regression in existence. Competing ML dataset collections--such as the UCI Machine Learning Repository (Dua and Graff, 2017) or Kaggle Datasets--tend to contain a mixture of classification, regression and other datasets, with varying degrees of documentation/preprocessing and often inadequately characterized measures of data quality. Other, smaller collections of datasets--like Scikit-Learn's datasets module (Pedregosa et al., 2011)--can be well-documented and curated, but lack the breadth and scope of PMLB. PMLB aims to balance this tradeoff, a task which we approach through a combination of crowdsourcing datasets, automating the assessment of data quality, and utilizing appropriate third-party tools, such as GitHub's continuous integration features, Pandas Profiling and Git Large File Store, as described in the following text. PMLB consists of three main components: (i) the collection of benchmark datasets, including metadata and associated documentation, (ii) a Python interface for easily accessing the datasets in the PMLB collection and (iii) an R interface providing similar functionality to the Python interface.


Nebullvm, an open-source library to accelerate AI inference by 5–20x in a few lines of code

#artificialintelligence

It takes your AI model as input and outputs an optimized version that runs 5–20 times faster on your hardware. In other words, nebullvm tests multiple deep learning compilers to identify the best…


Europe's shot for Artificial General Intelligence 🇪🇺🤖 -- Why we invested in Aleph Alpha

#artificialintelligence

We are excited to have co-led the € 23 million Series A financing round of Heidelberg-based Aleph Alpha together with our friends at Lakestar, UVC and existing investors LEA Partners, 468 Capital and Cavalry Ventures. The exceptional team around AI-serial-entrepreneur Jonas Andrulis and co-founder Samuel Weinbach research, develop and operationalise a new generation of huge and powerful AI like GPT-3, DALL-E or MuZero to maintain European sovereignty for Artificial General Intelligence (AGI). Andrew Ng, one of the leading AI experts stated in 2017: "AI is the new electricity" -- but is it really? Today, narrow AI -- models that are trained to perform one very specific task like chess or solving equations on or above human level -- have automated the development of Covid-19 vaccines, autonomous driving, the creation of music, perfumes, and a lot more. Yes, AI has led to mind boggling results but to fully gauge its potential, I'd like to put things into context: If humanity would have existed an equivalent of 1 day (to represent 300k years) and electricity for 1 minute (to represent 200 years), then narrow AI would be around for about 5 seconds (representing less than two decades).


Council Post: Autonomous Advertising: Mapping The Future Of Machine Learning In Ad Tech

#artificialintelligence

The auto industry is on a mission to improve the experience and value of vehicles by autonomizing transport so owners can simply input a destination and have the vehicle drive itself to the desired location. Reaching this goal requires complex machine learning models that consider the countless variables a vehicle may encounter on the road, even as it performs tasks a human would find trivial. So, what do autonomous vehicles have in common with digital advertising? Programmatic advertising technology companies are also on a mission to improve the user experience and the value derived from their platforms. And the serious contenders are using machine learning to get there.